Recognizing semantically similar sentences or paragraphs across languages isbeneficial for many tasks, ranging from cross-lingual information retrieval andplagiarism detection to machine translation. Recently proposed methods forpredicting cross-lingual semantic similarity of short texts, however, make useof tools and resources (e.g., machine translation systems, syntactic parsers ornamed entity recognition) that for many languages (or language pairs) do notexist. In contrast, we propose an unsupervised and a very resource-lightapproach for measuring semantic similarity between texts in differentlanguages. To operate in the bilingual (or multilingual) space, we projectcontinuous word vectors (i.e., word embeddings) from one language to the vectorspace of the other language via the linear translation model. We then alignwords according to the similarity of their vectors in the bilingual embeddingspace and investigate different unsupervised measures of semantic similarityexploiting bilingual embeddings and word alignments. Requiring only alimited-size set of word translation pairs between the languages, the proposedapproach is applicable to virtually any pair of languages for which thereexists a sufficiently large corpus, required to learn monolingual wordembeddings. Experimental results on three different datasets for measuringsemantic textual similarity show that our simple resource-light approachreaches performance close to that of supervised and resource intensive methods,displaying stability across different language pairs. Furthermore, we evaluatethe proposed method on two extrinsic tasks, namely extraction of parallelsentences from comparable corpora and cross lingual plagiarism detection, andshow that it yields performance comparable to those of complexresource-intensive state-of-the-art models for the respective tasks.
展开▼